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Often the goal of model selection is to choose a model for future 

prediction, and it is natural to measure the accuracy of a future 

prediction by squared error loss. Under the Bayesian approach, it is 

commonly perceived that the optimal predictive model is the model 

with highest posterior probability, but this is not necessarily the case. 

^H , In this paper we show that, for selection among normal linear models, 

^^ ' the optimal predictive model is often the median probability model, 

which is defined as the model consisting of those variables which have 

overall posterior probability greater than or equal to 1/2 of being in 

Cy ' a model. The median probability model often differs from the highest 

probability model. 






^ 



1. Introduction. Consider the usual normal linear model 
>, 
^. (1) y = X/3 + e, 

\o. 

•r^ • where y is the n x 1 vector of observed values of the response variable, 

^O , X is the n X k {k < n) full rank design matrix of covariates, and /9 is a 

zi I k X 1 vector of unknown coefficients. We assume that the coordinates of the 

(^ ■ random error vector e are independent, each with a normal distribution with 

mean and common variance a'^ that can be known or unknown. The least 
squares estimator for this model is thus (3 = (X'X)~^X'y. 
rH . Equation (1) will be called the full model, and we consider selection from 

" ' among submodels of the form 






X 



(2) Mi:y = Xi/3i + e, 

^ I where 1 = (^i , Z2, • • • , h) is the model index, /j being either 1 or as covariate 

Xi is in or out of the model (or, equivalently, if (3i is set equal to zero); Xi 
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contains the columns of X corresponding to the nonzero coordinates of 1; 
and (3i is the corresponding vector of regression coefficients. 

Upon selecting a model, it will be used to predict a future observation 

(3) y*=x*/3 + e, 

where x* = {x^, . . . , x^) is the vector of covariates at which the prediction is 
to be performed. The loss in predicting y* hy y* will be assumed to be the 
squared error loss 

(4) Lir,y*) = ir-y*f- 

With the Bayesian approach to model selection it is commonly perceived 
that the best model will be that with the highest posterior probability. This 
is true under very general conditions if only two models are being enter- 
tained [see Berger (1997)] and is often true in the variable selection problem 
for linear models having orthogonal design matrices [cf. Clyde (1999) and 
Clyde and George (1999, 2000)], but is not generally true. Indeed, even when 
only three models are being entertained essentially nothing can be said about 
which model is best if one knows only the posterior probabilities of the mod- 
els. This is demonstrated in Section 5, based on a geometric representation 
of the problem. 

For prediction of a single y* at a specific x*, one can, of course, sim- 
ply compute the posterior expected predictive loss corresponding to each 
model and choose the model that minimizes this expected loss. In such a 
scenario, however, choosing a specific model makes little sense; one should, 
rather, base the prediction on Bayesian model averaging [cf. Clyde (1999) and 
Hoeting, Madigan, Raftery and Voliksky (1999)]. The basic use of model se- 
lection for prediction is when, because of outside constraints, a single model 
must be selected for repeated use in future predictions. (Note that we are 
assuming that these constraints preclude use of the Bayesian model aver- 
aging answer.) It is natural to assume that these future predictions will be 
made for covariates x* that arise according to some distribution. We further 
assume that the kx k expectation matrix corresponding to this distribution, 

(5) Q = E[(x*)'(x*)], 

exists and is positive definite. A frequent choice is Q = X'X, which is equiv- 
alent to assuming that the covariates that will occur in the future are like 
those that occurred in the data. (Strictly, this would yield Q = -X'X, but 
constants that are common across models can be ignored.) 

In this scenario, one could still simply compute the expected predictive 
loss corresponding to each model and minimize, but the expectation would 
now also be over x*. This can add quite a computational burden, especially 
when there are many models to consider. Bayesian MCMC schemes have 
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been developed that can effectively determine the posterior model probabil- 
ities P(Mi|y), but adding an expectation over x* and a minimization over 1 
can be prohibitive [although see Miiller (1999)]. We thus sought to determine 
if there are situations in which it is possible to give the optimal predictive 
model solely in terms of the posterior model probabilities. 

Rather general characterizations of the optimal model turn out to be 
frequently available but, quite surprisingly, the characterizations are not in 
terms of the highest posterior probability model, but rather in terms of what 
we call the median probability model. 

Definition 1. The posterior inclusion probability for variable i is 

(6) Pi= E ^(^^ily)' 

that is, the overall posterior probability that variable i is in the model. 

If it exists, the median probability model Mj* is defined to be the model 
consisting of those variables whose posterior inclusion probability is at least 
1/2. Formally, 1* is defined coordinatewise by 

(7) Z* = i ■*■' ^^ P^ - 2' 

' \0, otherwise. 

It may happen that the set of covariates defined by (7) does not correspond 
to a model under consideration, in which case the median probability model 
will not exist. There are, however, two important cases in which the median 
probability model is assured to exist. The first is in the problem of variable 
selection when any variable can be included or excluded from the model (so 
that all vectors I are possible). 

The second case of particular interest is when the class of models under 
consideration has a graphical model structure. 

Definition 2. Suppose that for each variable index i there is a corre- 
sponding index set I{i) of other variables. A subclass of linear models has 
graphical model structure if it consists of all models satisfying the condition 
"for each i, if variable Xi is in the model, then variables Xj with j G I{i) are 
in the model." 

It is straightforward to show that if a subclass of linear models has graph- 
ical model structure, then the median probability model will satisfy the con- 
dition in the definition and, hence, will itself be a member of the subclass. 

One common example of a class of linear models having graphical model 
structure is the class of all models that can be constructed from certain main 
effects and their interactions up to a certain order, subject to the condition 
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that if a high order interaction of variables is in a model then all lower order 
interactions (and main effects) of the variables must be in the model. 

A second example of a subclass having graphical model structure is a 
sequence of nested models, 

(8) Mi(,), j = 0,...,A;, wherel(j) = (l,...,l,0,...,0), 

with j ones and k — j zeroes. Examples of this scenario include polynomial 
regression, in which j refers to the polynomial order used, and autoregressive 
time series, in which j refers to the allowed lag. Note that for nested models 
the median probability model has a simpler representation as Muj*-j, where 
j* is such that 

i*-i i* 

(9) ^P(M,(,)|y)<i and 5]P(Mi(,)|y) > i. 
i=0 i=0 

In other words, one just lists the sequence of posterior model probabilities 
and sums them up until the sum exceeds 1/2. The model at which the 
exceedance occurs is the median probability model. 

The above special cases also define the scenarios that will be investigated 
in this paper. The goal will be to provide conditions under which the median 
probability model is the optimal predictive model. The conditions are pri- 
marily restrictions on the form of the predictors for y* . The restrictions are 
fairly severe, so that the results can best be thought of as applying primarily 
to default Bayes or empirical Bayes types of procedures. 

Initially we had sought to find conditions under which the highest poste- 
rior probability model was the optimal predictive model. It came as quite a 
surprise to find that any optimality theorems we could obtain were, instead, 
for the median probability model. Frequently, however, the median proba- 
bility model will coincide with the highest posterior probability model. One 
obvious situation in which this is the case is when there is a model with 
posterior probability greater than 1/2. Indeed, when the highest posterior 
probability model has substantially larger probability than the other mod- 
els, it will typically also be the median probability model. Another situation 
in which the two coincide is when 

k 

(10) P{M,\y) = f[p^{l-p,)'^'-'^\ 

i=l 

where the pi are the posterior inclusion probabilities in (6). This will be seen 
to occur in the problem of variable selection under an orthogonal design 
matrix, certain prior structures, and known variance o"^, as in George and 
McCulloch (1993). [Clyde and Parmigiani (1996) and Clyde, DeSimone and 
Parmigiani (1996) show that (10) can often be approximately satisfied when 
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a"^ is unknown, and it is likely that the median probability model will equal 
the maximum probability model in such cases.] 

That the median probability model is optimal (under restrictions) for 
both the variable selection problem and the nested case, which are very 
different in nature, suggests that it might quite generally be the optimal 
predictive model and should replace the highest posterior probability model 
as the "preferred" predictive model in practice. (We will see evidence of 
this later.) Note also that determination of the median probability model 
is very straightforward within ordinary MCMC model search schemes. In 
these schemes one develops a Markov chain to move between the models, 
with the posterior probability of a model being estimated by the fraction of 
the time that the model is visited by the chain. To determine the median 
probability model one need only record, for each variable, the fraction of 
the time that the variable is present in the visited models; at the end of 
the MCMC one chooses those variables for which the fraction exceeds 1/2. 
Indeed, determining the median probability model in this fashion will often 
be computationally simpler than finding the highest posterior probability 
model. In the variable selection problem, for instance, accurately determin- 
ing when k fractions are above or below 1/2 is often much easier than trying 
to accurately estimate the fractional visiting times of 2^ models. Note also 
that in the orthogonal design situation mentioned above the posterior inclu- 
sion probabilities are actually available in closed form. 

The difference between predictive optimality and highest posterior model 
probability also explains several misunderstandings that have arisen out 
of the literature. For instance, Shibata (1983) shows that the BIC model 
selection criterion is asymptotically inferior to AIC for prediction in sce- 
narios such as polynomial regression, when the true regression function is 
not a polynomial. This has been routinely misinterpreted as saying that the 
Bayesian approach to model selection is fine if the true model is among those 
being considered, but is inferior if the true model is outside the set of candi- 
date models. Note, however, that BIC is essentially just an approximation to 
the log posterior probability of a model, so that model selection according 
to BIC is (at best) just selecting the highest posterior probability model, 
which is often not the optimal Bayesian answer. Indeed, as discussed above, 
the optimal Bayesian predictive model in the situation of Shibata (1983) is 
actually the median probability model. [There are also concerns with the 
applicability of BIC as an approximation to log posterior probability here; 
see Berger, Ghosh and Mukhopadhyay (2003) for further discussion.] 

In Section 2, we set the basic notation for the prediction problem and 
give the formula for predictive expected loss. Section 3 gives the basic the- 
ory concerning optimality of the median probability model, and discusses 
application in nested model and ANOVA scenarios. Section 4 generalizes 
the basic theory to deal with problems in which all models have common 
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nuisance parameters and the design matrix is nonorthogonal. A geometric 
description of the problem is provided in Section 5; this provides consider- 
able insight into the structure of the problem. Finally, Section 6 gives some 
concluding comments, primarily relating to the limitations of the theory. 

2. Preliminaries. 

2.1. Posterior inputs to the prediction problem. Information from the 
data and prior is summarized by providing, for all 1, 

p\ = P(Mi|y) the posterior probability of model Mi, 

(11) 7ri(/3i,cr|y) the posterior distribution 

of the unknown parameters in My. 

These inputs will often arise from a pure Bayesian analysis based on initial 
specification of prior probabilities P{Mi) for the models, together with prior 
distributions 7ri(/3i,cr) for the corresponding unknown parameters. Then, 
given the data y, the posterior probability of Mi is given by 



P(Mi)mi(y) 
El*P(Mi.)mp(y)' 



(12) PI 

where 

(13) mi(y) =y7ri(/3„f7)/i(y|/3i,cT)d/3,da 

is the marginal density of y under Mi, with /i(y |/3j, a) denoting the normal 
density specified by Mi. Likewise, the posterior distributions -ki{(3i, a\y) are 
given by straightforward application of Bayes theorem within each model. 

We allow, however, for nontraditional determination of the pi and 7ri(/3i, cr|y), 
as can occur with use of default strategies. In particular, it is not uncommon 
to use separate methodologies to arrive at the pi and the 7ri(/3i,cr|y), the pi 
being determined through use of a default model selection tool such as BIC, 
Intrinsic Bayes Factors [cf. Berger and Pericchi (1996a)] or Fractional Bayes 
Factors [cf. O'Hagan (1995)]; and the vri(/3i,cr|y) being determined from or- 
dinary noninformative priors, typically the reference priors, which are either 
constant in the known variance case or given by 

(14) ^j(/3„a) = i 

(T 

in the unknown variance case. Of course, this may be incoherent from a 
Bayesian perspective, since it essentially means using different priors to de- 
termine the pi and the 7ri(/3i,(j|y). 

The result of following such a mixed strategy also allows non-Bayesians 
to connect with this methodology. In particular, the predictor that results 
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from use of the reference prior in model Mi is easily seen to be the usual 
least squares predictor, based on the least squares estimate 

3, = (x;xi)-ix;y. 

Thus, for instance, the common use of BIC together with least squares esti- 
mates can be converted into our setting by defining pi oc e . 

Finally, the empirical Bayes approach is often used to obtain estimated 
versions of the quantities in (11). Again, while not strictly coherent from a 
Bayesian perspective, one can utilize such inputs in the following method- 
ology. 

2.2. Predictors and predictive expected loss. It is easy to see that the 
optimal predictor of y* in (3), under squared error loss and when the model 
Ml, of dimension ki, is true, is given by 

yr = x*Hi3i, 

where /3i is the posterior mean of /3i with respect to T^\{f3i, crly) and Hi 
is the k X ki matrix whose {i,j) entry is 1 if /j = 1 and j = J2r=i^r and 
is otherwise. Note that Hi is simply the matrix such that xHi is the 
subvector of x corresponding to the nonzero coordinates of 1, that is, the 
covariate vector corresponding to model Mi. The posterior mean of f3 in 
the full model is thus formally written as Pn^,,,u, but we will drop the 

subscript and simply denote it by /3 (as we have done with /3, the least 
squares estimate for the full model). 

The optimal Bayesian predictor of y* is well known to be the model 
averaging predictor, given by 

(15) r = x*3^x*^piHi3i. 

1 

Note that "optimal" is defined in terms of expected loss over the posterior 
in (11); if the posterior arises from an incoherent prior, as is allowed above, 
there is no guarantee that the resulting procedure possesses any other opti- 
mality properties — even "dutch books" against the procedure or frequentist 
inadmissibility could result. 

The best single model for prediction can be found by the following lemma. 

Lemma 1. The optimal model for prediction of y* in (3) under the 
squared error loss (4), when the future covariates satisfy (5) and the pos- 
terior distribution is as in (11), is the model that minimizes 

(16) i?(Mi) = (Hi3,-^)'Q(Hi3,-^), 
where P is defined in {15). 
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Proof. For fixed x* a standard result [see, e.g., Bernardo and Smith 
(1994), page 398] is that 

where C does not depend on 1 and the expectation is with respect to the 
predictive distribution of y* given y. Since 

(yf - r? = (HiA - ^) W(H,A - 3), 

taking the expectation over x* and using (5) yields the result. D 

3. Basic results and examples. Section 3.1 presents the basic theorem 
that is used to establish optimality of the median probability model. Section 
3.2 considers the situation in which all submodels of the linear model are 
allowed. Section 3.3 deals with nested models and Section 3.4 considers the 
ANOVA situation. 

3.1. Basic theory. Assume X'X is diagonal. Then it will frequently be 
the case that the posterior means /3j satisfy 

(17) A = Hi;9, 

that is, that the posterior mean of /3i is found by simply taking the relevant 
coordinates of /3, the posterior mean in the full model. Here are two common 
scenarios in which (17) is true. 

Case 1. Noninformative priors for model parameters: Use of the ref- 
erence priors in (14) (or constant priors when a'^ is known) results in the 
posterior means being the least squares estimates Pi- Because X'X is diag- 
onal, it is easy to see that (17) is then satisfied. 

Case 2. Independent conjugate normal priors: In the full model suppose 
that 7r(/3|(T) is 7Vfc(/i,(T^A), the fc-variate normal distribution with mean // 
and diagonal covariance matrix cr^A, with A given. Then it is natural to 
choose the priors on I3y in the submodels to be A/a,', (Hj/x,cr^H[AHi), where 
ki is the dimension of f3^. It is then easy to verify that (17) holds for any 
prior on o"^ or for cj"^ being given (e.g., known, or estimated). Note that we 
do not necessarily recommend using this conjugate form of the prior with 
unknown o"^; see Berger and Pericchi (2001) for discussion. 

While A could be chosen subjectively, it is more common to utilize default 
choices, such as the g-type normal priors [cf. Zellner (1986)] A = n(X'X)~^ 
or A = c(X'X)~^, with c chosen by an empirical Bayes analysis (e.g., chosen 
to maximize the marginal density averaged over models). Papers which fall 
under these settings include Chipman, George and McCulloch (2001), Clyde 
and Parmigiani (1996), Clyde, DeSimone and Parmigiani (1996), Clyde, 
Parmigiani and Vidakovich (1998) and George and Foster (2000). 
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Note that one can use noninformative priors for certain coordinates and 
independent conjugate normal priors for other coordinates. This is particu- 
larly useful when all models under consideration have "common" unknown 
parameters; it is then typical to utilize noninformative priors for the common 
parameters, while using independent conjugate normal priors for the other 
parameters [see Berger, Pericchi and Varshavsky (1998) for justification of 
this practice]. 

Lemma 2. IfQis diagonal with diagonal elements qi> and {iTj holds, 
then 

k 
(18) R{M,)=Y,fi^qSi-Pi)\ 

i=l 

where pi is as in {6). 

Proof. From (17) it follows that 

P ^ ^PiHi^i = E^iHiH'i^ = D(p)^, 
1 1 

where D(p) is the diagonal matrix with diagonal elements pi. Likewise, (16) 
becomes 

i?(Mi) = (HiH'i^ - D(p)^)'Q(HiH;^ - D(p)3) 

= ^'(D(l)-D(p))Q(D(l)-D(p))^, 

and the conclusion is immediate. D 

Theorem 1. If CI is diagonal with diagonal elements qi > 0, condition 
{17) holds and the models under consideration have graphical model struc- 
ture, then the median probability model is the best predictive model. 

Proof. To minimize (18) among all possible models, it is clear that one 
should choose li = \ if Pi > 1/2 and /« = otherwise, which is as in (7). As 
mentioned earlier, the graphical model structure ensures that the model so 
defined is actually in the space of models under consideration, completing 
the proof. D 

The above theorem did not formally use the condition that X'X be di- 
agonal. However, if it is not diagonal, then (17) will not typically hold, nor 
will Q usually be diagonal. 
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3.2. All submodels scenario. Under the same conditions as in Section 
3.1 the following corollary to Theorem 1 gives the median probability model 
when all submodels are considered. 

Corollary 1. If Q is diagonal with diagonal elements Qi > 0, condi- 
tion (17) holds and any submodel of the full model is allowed, then the best 
predictive model is the median probability model given by (7). In addition, if 
(7^ is given in Case 2 of Section 3. 1 and the prior probabilities of the models 
satisfy 

k 

(19) P{M,) = \{{p',f{l-p^f'-^^\ 

where p^ is the prior probability that variable Xi is in the model, then {10) is 
satisfied and the median probability model is the model with highest posterior 
probability. 

Proof. The first part of the corollary is immediate, since {all submod- 
els} clearly has graphical model structure. 

For Case 2 and given cr^, computation as in Clyde and Parmigiani (1996) 
and Clyde, DeSimone and Parmigiani (1996) shows that (10) is satisfied 
with 

,-..:.(^^-i)a.«,-e^-%^|l^}, 

where the {d,} and {Aj} are the diagonal elements of X'X and A, respec- 
tively, and V = (ui, . . . , Vk)' = X'y- □ 

While many proposed choices of prior probabilities satisfy (19), others do 
not. For instance, Jeffreys (1961) suggested that it might often be reason- 
able to choose the prior probability of given model orders to be decreasing, 
for example, P(order j) oc 1/j, with this probability then being divided up 
equally among all models of size j. Such an assignment of prior probabilities 
would not typically satisfy (19), and the best predictive model (i.e., the me- 
dian probability model) would then not necessarily be the highest posterior 
probability model, even in Case 2. 

Finally, it should be noted that Corollary 1 also applies to the case where 
all models under consideration have "common parameters." One can simply 
define p^* = 1 for such parameters. 

3.3. Nested models. We initially consider the orthogonal case, as in Sec- 
tion 3.1. This is generalized to the nonorthogonal case at the end of the 
section. 
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3.3.1. Orthogonal case. 

Corollary 2. If Q is diagonal with diagonal elements qi > 0, condi- 
tion {iTj holds and the models under consideration are nested, then the best 
predictive model is the median probability model given by (7) or (9). 

Proof. This is immediate from Theorem 3.1, because nested models 
have graphical model structure. D 

Example 1. Nonparametric regression [also studied in Mukhopadhyay 
(2000) for a different purpose]: The data consists of the paired observations 
{xi,yi),i = 1, . . . ,n, where for known o"^, 

(20) yi = f{xi) + ei, e,~iV(0,cj2). 

Represent /(•), an unknown function defined on the interval (—1,1), using 
an orthonormal series expansion, as 



fi^) = '^PiMx), 



i=l 

where {(j)i{x),(l)2{x), . . . } are the Chebyshev polynomials. Of course, only a 
finite approximation to this series can be utilized, so define the model Muj^ 
[see (8) for notation] to be 

j 
M^j):y = J2l3Mx)+e, e^N{0,a^). 

The problem is thus to choose among the nested sequence of linear models 
Mi(j), J = 1, .... As is common in practice, we will choose an upper bound 
k on the size of the model, so that Min^\ is the full model in our earlier 
terminology. 

The function /(x) = — log(l — x) was considered in Shibata (1983) as an 
example for which BIG yielded considerably suboptimal models for predic- 
tion. It is hence of interest to see how the median probability model fares in 
this situation. 

We assume the yi are observed at the covariates Xi = Cosine([n — ^ + ^] ^) , 
i = l, . . . ,n, and let Xj = {(pmixi)) be the resulting nx j design matrix with 
indicated {i,m) entries, i = 1, . . . ,n and m = 1, . . . ,j. Prom the definition 
of Chebyshev polynomials it follows that X'Xj = ^Ij, where \j is the j x j 

identity matrix. It follows that the least squares estimate oi (3j is 0j = ^X'y. 

Assume that the models have equal prior probability 1/A;, and that, within 

any model Muj\ the f3i have independent A^(0,ct~") prior distributions for 

some constants c and a (which are the same across models). The choice of 
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a determines how quickly the prior variances of the (3i decrease (any L2 
function must have a > 1), and we shall consider three choices: a = 1, a = 2 
(which happens to be the rate corresponding to the test function) and a = 3. 
For simplicity of calculation we estimate c by an empirical Bayes analysis 
using the full model Muj^\ , keeping the estimate c fixed across models. Then 
if Q is diagonal (as would thus be the case for the natural choice Q = X^X^), 
Corollary 2 implies that the median probability model will be the optimal 
Bayesian predictive model. 

For nonparametric regression it is common to utilize the loss function 

(21) L{f,f)=j\f{x)-f{x)fdx. 

In the predictive context use of this loss is equivalent to prediction un- 
der squared error loss when the future covariates x are thought to be uni- 
formly distributed on (—1, 1). A standard computation shows that i(/, /) = 
X]i^i(/3i — A)^, where /?j stands for the estimator that is used for the true 
coefficient /?j. Since we have restricted the models under consideration to 
be of maximum order /c, it follows that /3j = for i> k va our problem and 
the loss arising from these coordinates can be ignored in model comparison. 
The resulting loss is also easily seen to be equivalent to the predictive loss 
we have previously used, with Q = I^. 

The optimality of the median probability model is with respect to the 
internal Bayesian computation, which assumes that the true function is ac- 
tually one of the models Afi(j), j = ^,- ■ ■ ,k- For the example considered 
by Shibata, however, the true model lies outside the collection of Bayesian 
models (since it can be shown that none of the Pi in the expansion for this 
function is zero). It is thus of interest to see how the median probability 
model performs in terms of the loss (21) for the (here known) function. 

Under M^^) the estimates of the Pi are the (empirical) Bayes estimates 

Pi = {1 + 2a'^i^/nc)~^Pi if i <k, and Pi = otherwise. Hence the predictive 
loss under model Mi(j) is J2i=i{l^i~ Pi)'^ ~^'l2'iZj+i Pf-, although we will ignore 
the terms in the sum for i> k since they are common across all considered 
models. 

In Table 1 we compare the expected predictive loss (the frequentist ex- 
pectation with respect to the data under the true function) of the maximum 
probability model with that of the median probability model. We also in- 
clude the model averaged estimate arising from (15) in the comparison; this 
is the optimal estimate from the internal Bayesian perspective. Finally, we 
also consider AIC and BIC. These model selection criteria are most com- 
monly used in conjunction with least squares parameter estimates, and that 
choice is made for computing the expected predictive losses in Table 1. 

The entries in the table were computed by simulating data from the true 
function, selecting a model, computing the corresponding function estimate. 
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and finally determining the actual loss. This was repeated a total of A^ = 
1000, 1000, 100 times, respectively, for the three cases in the table, with the 
resulting averages forming the table entries. Note that the largest model 
sizes considered for the three cases were k = 29, 79, 79, respectively. 

Our main goal was to compare the maximum probability model and the 
median probability model. The median probability model is clearly signif- 
icantly better, even in terms of this frequentist expected loss. (Again, we 
know it is better in terms of Bayesian predictive loss.) Indeed, the median 
probability model is almost as good as the model averaging estimate (and 
in two cases is even better); since model averaging is thought of as optimal, 
its near equivalence with MedianProb is compelling. 

Note that AIC does seem to do better than BIG, as reported by Shibata, 
but all of the actual Bayesian procedures are considerably better than either. 
This is in part due to the fact that the Bayesian procedures use better 
parameter estimates than least squares, but the dominance of MedianProb 
and ModelAv (but not MaxProb) can be seen to hold even if the superior 
shrinkage estimates are also used with BIG and AIG. 

3.3.2. Nonorthogonal case. Gorollary 2 presented a result for nested mod- 
els in the case of an orthogonal design matrix. It is not too surprising that 

Table 1 

For various n and a the expected loss and average model size for the 

maximum probability model (MaxProb), the median probability model 

(MedianProb), model averaging (ModelAv), and BIG and AIC, in the 

Shibata example 











Expected loss 


[average model size] 








MaxProb 


MedianProb 


ModelAv 


BIC 


AIC 


n - 


= 30,( 


7^ = 1 


















a - 


= 1 


0.99 


[81 


0.89 


[10] 


0.84 


1.14 


fs] 


1.09 


[7] 


a - 


= 2 


0.88 


[101 


0.80 


[16] 


0.81 


1.14 


[8] 


1.09 


[7] 


a - 


-3 


0.88 


[9] 


0.84 


[17] 


0.85 


1.14 


[8] 


1.09 


[7] 


n - 


= 100 


,-^ = 


1 
















a - 


= 1 


0.54 


[141 


0.51 


[19] 


0.47 


0.59 


[111 


0.59 


[131 


a - 


= 2 


0.47 


[231 


0.43 


[43] 


0.44 


0.59 


[111 


0.59 


[131 


a - 


= 3 


0.47 


[221 


0.46 


[45] 


0.46 


0.59 


[111 


0.59 


[131 


n - 


= 2000,cr2 = 


3 
















a - 


= 1 


0.34 


[231 


0.33 


[26] 


0.30 


0.41 


[121 


0.38 


[211 


a - 


= 2 


0.26 


[421 


0.25 


[51] 


0.25 


0.41 


[121 


0.38 


[211 


a - 


= 3 


0.29 


[381 


0.29 


[50] 


0.29 


0.41 


[121 


0.38 


[211 
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such results can be obtained under orthogonality. Quite surprising, however, 
is that the orthogonality condition can be removed under the following two 
conditions: 

Condition 1. Q = 7X'X for some 7 > 0. 

Condition 2. (3i = b(3i, where 6 > 0, that is, the posterior means are 
proportional to the least squares estimates, with the same proportionality 
constant across models. 

Note that Condition 2 is merely a special case of (17), so the generality 
here is in Condition 1, allowing for nondiagonal Q. (Of course, Q = 7X'X 
is diagonal under orthogonality.) 

There are two common scenarios in which Condition 2 is satisfied. The 
first is when the reference priors (14) are used, in which case the posterior 
means are the least squares estimates. The second is when using g-type nor- 
mal priors [cf. Zellner (1986)], where TTi{(3i\a) is A4i(0,C(T^(X'[Xi)^^), with 
the same constant c > for each model. (This constant could be specified or 
estimated in an empirical Bayesian context.) It is then easy to verify that 
Condition 2 holds with b = c/(l + c) (irrespective of the prior for a). 

Theorem 2. For a sequence of nested models for which Conditions 1 
and 2 hold, the best predictive model is the median probability model given 
by (7) or {9). 

Proof. From (16) and using Conditions 1 and 2, 
(22) i?(Mi(,)) = 762(Hi(,)3,y) - ^)'X'X(Hi(,)3,y) - ^). 

Noting that XHi^^) = ^Uj) and defining Pi = Xi(XjXi)~^Xj, it follows that 

i2(M,(,)) = 76V' f Pio) - T^pm^m) y- 

Note that P^ = Pi and Pi(j)Pi(j) = Pi(min{i.j})- Hence, expanding the quadratic 
in (22) yields 

(k \ ^ 

+ 7&V I Pi(i) - 2 Epi(i)Pi(,) - 2 Epi(i)Pi(j) J y. 

\ i=l i=j / 
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It follows that 

i?(M,(,+i))-i?(Mi(,))=7&2(l-2 ^ Pi(.))y(Pio+i)-Pio))y. 

Since y'(Pi(j-|-i) — Pi(j))y > and the (1 — 2X]i=j+iPi(i)) are increasing in j 
from —1 to +1, moving to a larger model will reduce the risk until (1 — 
2Si=7+iPi(i)) fii'st turns positive. The conclusion is immediate. D 

An example of a nested model in the nonorthogonal case will be given in 
Section 4. 

3.4. AN OVA. Many AN OVA problems, when written in linear model 
form, yield diagonal X'X and any such problems will naturally fit under the 
theory of Section 3.1. In particular, this is true for any balanced ANOVA in 
which each factor has only two levels. 

To see the idea, it suffices to consider the case of two factors A and B 
each with two levels. The full two-way ANOVA model with interactions is 

Vijk = f^ + ai + bj + abij + Eijk 

with z = 1, 2, j = 1, 2, k = 1,2, . . . ,K and e^k independent A^(0, o"^), with 
a'^ unknown. In our earlier notation, this can be written 

y = x/3 + £, 

where 

y = (yiii, • • • , yiiK,yi2i, • ■ • , vuk, y2ii, • • • , y2iK, 2/221, • • • , 2/22^)', 

f3 = {^,ai,hi,ahii)' 

and X is the 4:K x 4 matrix 

/I 1 1 1 \ 



X 



1 


1 


1 


1 


1 


1 


-1 


-1 


1 


1 


-1 


-1 


1 


-1 


1 


-1 


1 


-1 


1 


-1 


1 


-1 


-1 


1 



Vi -1-1 1 / 
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where the last column is the product of the second and the third, since 
ai = — a2, bi = —62, abii = 0622 = —abi2 = — afe2i- Computation then shows 
that X'X = 4Eri4, so that the earlier theory will apply. 

There are several model comparison scenarios of interest. We use a slight 
modification of the previous model notation for simplicity, for example, 
Mioii instead of M(]^o,i,i); representing the model having all parameters 
except ai. 

Scenario 1. All models with the constant ^: Thus the set of models 
under consideration is 

{Afiooo, Afiioo, Mioio, Miooi, Miioi, Mioii, Miiio, Mim}. 

Scenario 2. Interactions present only with main effects, and /i in- 
cluded: The set of models under consideration here is {Mioooi -^iioo; AfioiO; -^iiiO) 
-^1111 }• Note that this set of models has graphical structure. 

Scenario 3. An analogue of an unusual classical test: In classical ANOVA 
testing it is sometimes argued [cf. Scheffe (1959), pages 94 and 110] that one 
might be interested in testing for no interaction effect followed by testing for 
the main effects, even if the no-interaction test rejected. (It is argued that 
the hypotheses of zero main effects could still be accepted, which would 
imply that, while there are differences, the tests do not demonstrate any 
differences in the levels of one factor averaged over the levels of the other.) 
The four models that are under consideration in this process, including the 
constant // in all, are {Miioi,Mioii,Miiio,Miiii}. 

We do not comment upon the reasonableness of considering this class of 
models, but are interested in the class because it does not have graphical 
model structure and yet the median probability model is guaranteed to be in 
the class. To see this, consider the possibility that ai has posterior inclusion 
probability less than 1/2, and so would be excluded from the median proba- 
bility model. Clearly this can only happen if A/ion has posterior probability 
greater than 1/2; but then Mion would automatically be the median proba- 
bility model. Arguing similarly for bi and abn, one can conclude that Mim 
will be the median probability model, unless one of the other models has 
posterior probability greater than 1/2, in which case it will be the median 
probability model. 

Example 2. Montgomery [(1991), pages 271-274] considers the effects 
of the concentration of a reactant and the amount of a catalyst on the yield 
in a chemical process. The reactant concentration is factor A and has two 
levels, 15% and 25%. The catalyst is factor B, with the two levels "one bag" 
and "two bags" of catalyst. The experiment was replicated three times and 
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the data are given in Table 2. Note that the classical ANOVA tests of "no 
A effect," "no B effect" and "no interaction effect" resulted in p- values of 
0.00008, 0.00236 and 0.182, respectively. The Bayesian quantities that would 
be used analogously are the posterior variable inclusion probabilities, p2, Ps 
and p4. (Strictly, l—pi would be the analogue of the corresponding p- value.) 
To carry out the Bayesian analysis, the reference prior 7r{fi,a) oc - was 
used for the common parameters, while the standard A^(0,o"^) ^f-prior was 
used for ai, 6i and abu. In each scenario the models under consideration 
were given equal prior probabilities of being true. The conditions of Section 
3.1 are then satisfied, so that we know that the median probability model will 
be the optimal predictive model. For the three scenarios described above the 
results of the Bayesian analysis are given in Tables 3-5. In all three scenarios 
the median probability model indeed has the lowest posterior expected loss 
(as was known from the theory) . Interestingly, the median probability model 
equals the maximum probability model in all three scenarios and is the 
model MiiiQ. The variable inclusion probabilities show clearly that an "A 

Table 2 
Data for the 2^ ANOVA example 



Treatment Replicates 



combination 


I 


II 


III 


A low, B low 


28 


25 


27 


A high, B low 


36 


32 


32 


A low, B high 


18 


19 


23 


A high, B high 


31 


30 


29 



Table 3 

Scenario 1 (all models). Posterior probabilities and 

expected losses for the models. The posterior inclusion 

probabilities are P2 = 0.9977, ps = 0.9514 and 

P4 = 0.3621; thus Mmo is the median probability 

model 





Posterior 


Posterior 


Model 


probability 


expected loss 


Afiooo 


0.0008 


235.47 


Miioo 


0.0342 


58.78 


Mioio 


0.0009 


177.78 


Afiooi 


0.0003 


237.43 


Miiio 


0.6019 


1.083 


Miioi 


0.0133 


60.73 


Mioii 


0.0003 


179.74 


Mini 


0.3483 


3.04 
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Table 4 

Scenario 2 (graphical models). Posterior probabilities 

and expected losses for the models. The posterior 

inclusion probabilities are p2 = 0.9982, ps = 0.9644 

and P4, = 0.3532; thus Mmo is the median probability 

model 





Posterior 


Posterior 


Model 


probability 


expected loss 


Miooo 


0.0009 


237.21 


Miioo 


0.0347 


60.33 


Mioio 


0.0009 


177.85 


Miiio 


0.6103 


0.97 


A/iiii 


0.3532 


3.05 



Table 5 

Scenario 3 (unusual classical models). Posterior 

probabilities and expected losses for the models. The 

posterior inclusion probabilities are p2 — 0.9997, 

P3 = 0.9862 and pi — 0.3754; thus Mmo is the median 

probability model 





Posterior 


Posterior 


Model 


probability 


expected loss 


Mioii 


0.0003 


180.19 


Ml 101 


0.0138 


64.93 


Mmo 


0.6245 


1.01 


Mini 


0.3614 


2.78 



effect" and a "B effect" should be in the model (with inclusion probabilities 
exceeding 0.99 and 0.95, respectively), while the interaction effect has a 
moderately small probability (about 1/3) of being in the model. 

We also carried out an analysis with the N{0,ca'^) g-pvior for ai, bi and 
abii, but with c being estimated by maximizing the overall marginal den- 
sity 7;Z]i"^i(y)) where the individual marginal densities rrai(y) are given by 
(13) and L is the number of models under consideration. The conditions of 
Section 3.1 are still satisfied, so that we know that the median probability 
model will be the optimal predictive model. The results did not significantly 
change from the above analysis, however, and so are not reported. 

Had 0"^ been known in Scenario 1, Corollary 1 would have established that 
the median probability model would equal the maximum probability model. 
Here o"^ is unknown, however, and it will not always be the case that the 
median probability model equals the maximum probability model. Indeed, 
we also carried out the analysis of the example utilizing the A^(0, o"^) g-prior 
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Table 6 

Scenario 3 (unusual classical models, with g -prior for ^). 

Posterior probabilities and expected losses for the models. The 

posterior inclusion probabilities are p2 = 0.876, ps — 0.714 and 

Pi = 0.544,- thus Mini is the median probability model 





Posterior 


Posterior 


Model 


probability 


expected loss 


A^ioii 


0.124 


143.03 


Miioi 


0.286 


36.78 


Miiio 


0.456 


10.03 


Mini 


0.134 


9.41 



for ft, as well as for ai, hi and abn, and found that the median probabil- 
ity model then differed from the maximum probability model in all three 
scenarios. Table 6 gives the results for Scenario 3; note that the median 
probability model is nearly the lowest probability model! (We do not, how- 
ever, recommend this analysis; g-priors should not be used for parameters 
common to all models.) 

4. Common nonorthogonal nuisance parameters. Frequently all models 
will contain "common" parameters /3^) = (/?i, . . . ,/3fc^). A typical example 
is when all models contain an overall mean f5i [or, equivalently, when the 
first column of each model design matrix is (1, . . . , 1)']. For the orthogonal 
case discussed earlier this caused no difficulties. For the nonorthogonal case, 
however, this considerably complicates the analysis. Still, we will see that 
the median probability model remains optimal under mild modifications of 
the previously considered conditions. 

To present the results, it is convenient to slightly change notation, writing 
the regression parameters of M\ as (/3m\,/3i)', with corresponding design 
matrix (Xn)Xi). Also, define 

.<,o^ Q(l) =I-X(i)(X'(i)X(i))-^X'(i), 

^ ^ Pi = Q;/)^X.(XiQ(,)XO-^XiQ;/)^ 

The necessary conditions are: 

Condition 1. Q = 7X'X for some 7 > 0. 

Condition 2. For some fixed 6 > the posterior means of (3y are of the 
form ^1 = 6(X;Q(i)Xi)-ix;Q(i)y. 

Condition 3. 7ri(/3(i),/3i|(7) = TTi{(3]\a) (i.e., the prior density for /3(]^) 
in each model is constant). 
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A g-type prior for which Condition 2 holds is 

7ri(/3i|a)=AA(0,ca2(x;Q(i)Xi)-^), 

with the same constant c > for each model. It is then easy to verify that 
Condition 2 holds with b = c/(l + c) (irrespective of the prior on a). Note 
that if XC-^nXi = 0, then XjQ(i)Xi = XjXi, so this would be a standard 
ff-type prior. 

Theorem 3. Under Conditions 1-3 the best predictive model under 
squared error loss minimizes 

(24) R{Mi) = C + -fbW I Pi - 2 ^pi*Pi.i. j w, 

where w = Q^'^'^y, C is a constant and 1-1* is the dot-product of I and 1*. 

Proof. Write x* = (x^^xL-,) and X = (X(i),X(2)), and define U = 
(X^nX(;^))~^X'.j^n and Vi = (X[Q(;^)Xi)~^X[Q(;^). Note that the noncommon 
variables in Mi are xLnHij, where Hij is the matrix consisting of the rows 
of Hi from ki + 1 to k. With this notation note that 

(25) yi* = x^i)^(i)+x^2)Hi2A- 
Using Condition 3, it is straightforward to show that 

E[/3(i)|y,/3i] = U(y-Xi/3,), 
so that 

%)=E[/3(i)|y]=U(y-XiE[/3i|y]). 
Using this in (25), together with Condition 2, yields 

y{ = x^i)U(I - 6XiVi)y + 6x^2)Hi,Viy 

'U(I-6XiVi) 
Defining 



" I 6H.V, ly- 



it follows (using Condition 1 in the third equality) that 
R{Mi) = B^'[y{-rf 



(26) 



■ b^B'' 



x*(Wi-^Pl.WiJy 
7^V(wi-^pi.Wi.j X'xfwi-^pi.WiJy. 
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Note that 

XWi = (X(i),X(2))(^"^^^')v, 

= -X(i)UXiVi + XiVi = qJ/)'piqJ/)'. 



Together with (26) this yields 

(27) 

the last step utilizing (23) and the fact that Qn) is idempotent. 

1/2 

Because Pi is the projection onto the columns of Qn) Xi that correspond 

to the nonzero elements of 1, it is clear that P^ = Pi and PiPp = Pm*- 

Expanding the quadratic in (27) with 



i?(Mi) = TfeV'Qj/)' fpi - Epi*Pi* ) Q(i) (pi - E^i-Pi* )y 

76VMPi-5^Pi.PiJ 



C = 7& V 



, 1* / 



yields the result. D 



Corollary 3 (Semi-orthogonal case). Suppose Conditions 1-3 hold 
and that X/2-,Q(i)X(2) is diagonal with positive entries, where the full design 
matrix is X = (Xn)X(2)). Then, if the class of models under consideration 
has graphical model structure, the best predictive model is the median prob- 
ability model given by (7). 

Proof. Writing X/2-,Q(i)X(2) = D(d), the diagonal matrix with diago- 
nal elements di > 0, Pi can be expressed as 

Pi = Q(^^ X(2)Hi2(Hj2X(2)Q(i)X(2)Hi2) 'H.l^X'(^2)Q{i) 
= Q(^) X(2)Hi2(Hi2D(d)Hi2) 'H.[,^X'(^2)^{i) 
= Q;/)%)(D(d.l))-iX'(2)Qj/'. 

1 /2 

Hence, defining u = X',2^Q^^ w, (24) becomes 

2„/ 



R{Mi) = C + -fb^u 



D(d-l)-i-2^pi.D(d.l-r)-i 
1* 

c+jb'j:n^d.%(i-2 y: p,X 



U 
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and the conclusion is immediate. D 

Note that X/2)Q(i)X/2) wih be diagonal if either (i) X'X is diagonal or 
(ii) X/2nX(2) is diagonal and X/-^nX(2) = 0. 

Corollary 4 (Nested case). Suppose Conditions 1-3 hold and that the 
-^i(i)) J = 0, . . . , fc, are a nested sequence of models. Then the best predictive 
model is the median probability model given by (7) or (9). 

Proof. For the nested case (24) becomes 
(28) i?(Mi(,)) = C + 7?'V(Pi(,) -2 5]pi(i)Pi(i) -2f]pi(,)Pi(,) )w. 

\ 1=0 i=j I 

It follows that 

i?(Miy+i))-fl(M,(,.)) 



762 1 - 2 ^ ^^^^ w'(Pi(^+i) - Pi(j))w. 

\ j=7+l / 



Since w'(Pi(j_|_i) — Pi(j))w > and the (1 — 2X)i=j+iPi(i)) are increasing 
in j from —1 to +1, moving to a larger model will reduce the risk until 
(1 — 2X]j=j+iPi{j)) first turns positive. The conclusion is immediate. D 

Example 3. Consider Hald's regression data [Draper and Smith (1981)], 
consisting of n = 13 observations on a dependent variable y with four po- 
tential regressors: rEi,X2,X3, 2:4. Suppose that the following nested models, 
all including a constant term c, are under consideration: 

Mi(i): {0,2:4}, Mi(2):{c,j;i,a;4}, 

^1(3) : {c, xx.xz, X4}, Mi(4) : {c, xi, X2, 3:3, a;4}, 

again using the notation in (8). We choose the reference prior (14) for the 
parameters of each model, which effectively means we are using least squares 
estimates for the predictions and ensures that Conditions 2 and 3 are satis- 
fied. (Here, the models have two common parameters, the constant term and 
the parameter corresponding to variable 0:4.) Choosing Q = X'X, it follows 
that the posterior predictive loss of each model is given by (24). 

Two choices of model prior probabilities are considered, P(hl\u\) = 1/4, i = 
1,2,3,4, and P*(Mi(j)) =i~^ /Y^'^j=i3~^ [th^ latter type of choice being dis- 
cussed in, e.g., Jeffreys (1961)]. Default posterior probabilities of each model 
are then obtained using the Encompassing Arithmetic Intrinsic Bayes Fac- 
tor, recommended in Berger and Pericchi (1996a, b) for linear models. The 
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resulting model posterior probabilities, P(Mi(j)|y) and P*(Mi(jJy), for the 
two choices of prior probabilities, respectively, are given in Table 7. The 
table also presents the normalized posterior predictive loss i?(Mi(-j)) — C for 
each model. 

Since these models are nested. Corollary 4 ensures that the median prob- 
ability model is the optimal predictive model. Using (9), it is clear from 
Table 7 that Mi(3) is the median probability model for both choices of prior 
probabilities. And, indeed, the posterior predictive loss of M^^-^ is the small- 
est. Note that Mu^,) is the maximum probability model for the first choice 
of prior probabilities, while Mu2) (which is suboptimal) is the maximum 
probability model for the second choice. 

5. A geometric formulation. It was stated in the Introduction that, in 
general, knowing only the model posterior probabilities does not allow one 
to determine the optimal predictive model. This is best seen by looking 
at the problem from a geometric perspective which, furthermore, provides 
considerable insight into the problem. 

Assuming the matrix Q in (5) is nonsingular and positive definite, con- 
sider its Cholesky decomposition Q = A'A, where A is a fc x A: upper trian- 
gular matrix. The expected posterior loss (16) to be minimized can then be 
written as 

(29) R{Mx) = {on-a.)'{on-a), 

where cx\ = AHi/3i is a A:-dimensional vector and a = A/3 = ^[PiAHi/3i. [If 
Q = X'X and (5\ = (3\, one can define a\ as cx\ = XHi/3i = Xi(X'[Xi)~^ Xjy, 
the projection of y on the space spanned by the columns of Xi.] It follows 
that the preferred model will be the one whose corresponding cx\ is nearest 
to a. in terms of Euclidean distance. 

The geometric formulation of the predictive problem follows by represent- 
ing each model M\ by the point cx\. The collection of models thus becomes a 
collection of points in /s-dimensional space. The convex hull of these points 
is a polygon representing the set of possible model averaged estimates a as 

Table 7 
Posterior probabilities and predictive losses for Hald's data 





Mi(i) 


M,(2) 


M,(3) 


Mi(4) 


^(Mwly) 

i?(M,(,))-C 


0.0002 



0.3396 

-808.81 


0.5040 
-816.47 


0.1562 
-814.43 


7?(M,(,))-C 


0.0005 



0.4504 
-808.32 


0.4455 
-810.67 


0.1036 
-808.31 
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the p\ vary over their range. Hence any point in this polygon is a possible 
optimal predictive model, depending on the p\, and the goal is to geomet- 
rically characterize when each single model is optimal, given that a single 
model must be used. 

Consider the simple situation in which we have two covariates xi and X2 
and three possible models: 

Mio:{xi}, Moi:{x2}, Afn : {xi,X2}, 

again writing, for example, Mqi instead of M(o,i). These can be represented 
as three points in the plane. (If the three models had a constant, or intercept, 
term, then the three points would lie on a plane in three-dimensional space, 
and the situation would be essentially the same.) 

Depending on the sample correlation structure, the triangle whose vertices 
are Qqi, ctio ^-iid ctii can have three interesting distinct forms. These three 
forms are plotted in Figure 1. Subregions within each plot will be denoted 
by the vertices; thus, in Figure 1(a) [aoi,-^, C] denotes the triangle whose 
vertices are olqi-, F and C. 

Each triangle can be divided into optimality subregions, namely the set 
of those a which are closest to one of the a^. These are the regions defined 
by the solid lines. Thus, in Figure 1(a), the triangle [aiQ,F,C] defines those 
points that are closer to ctio than to the other two vertices; hence, if a were 
to fall in this region the optimal single model would be Miq. If a were to fall 
in the triangle [aQi,B,E] the optimal single model would be Mqi and, if a 
were to fall in the region between the two solid lines the optimal single model 
would be Mil. It is easy to see that these optimality regions are formed by 
either (i) connecting the perpendicular bisectors of the sides of the triangle, 
if all angles are less than or equal to 90°, or (ii) drawing the perpendicular 
bisectors of the adjacent side of an angle that is greater than 90°. 

In each plot. A, B and C are the midpoints of the line segments qiqcioi, 
ctQictii and oliqoliI, respectively, while O is the midpoint of the triangle. 
These are of interest because they define regions such that, if a lies in the 
region, then the model corresponding to the vertex in the region has the 
largest posterior probability. Thus, in Figure 1(a), if o. lies in the polygon 
[q;io,^,0,C], then Miq must be the maximum posterior probability model. 

Note that the maximum posterior probability regions do not coincide 
with the optimal predictive model regions. As a dramatic illustration of the 
difference, consider Figure 1(a) and suppose that a lies on the line segment 
EF. Then Mn is the optimal predictive model, even though it has posterior 
probability 0. Also, either Miq or Mqi has posterior probability at least 1/2 
on this line segment, yet neither is the best predictive model. 

The dashed lines form the boundaries defining the median probability 
models. Thus, if o; lies in the triangle [ctio,^, C], then Miq will be the 
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Fig. 1. Three possible scenarios for the graphical representation of predictive model se- 
lection from among Mu) '■ {x\}, Moi : {x2} and A'h\ : {x\,X2\- 

median probability model, while if a lies in the polygon [C^A,B,cx\i\, then 
Mil will be the median probability model. To see why this is so, note that 
the line segment AC consists of the points for which P(Mio|y) = 1/2 (i.e., 
for which Miq has posterior probability 1/2). But then clearly the inclusion 
probability for variable X2 is also equal to 1/2 on this line segment, since 
P2 = P(Moi|y) + P(Mii|y) = 1 - P(Mio|y). Similarly, AB consists of the 
points for which the inclusion probability for variable xi is equal to 1/2. 
It is immediate that the median probability model in (7) is defined by the 
indicated regions. 

Figures 1(a) and (b) thus show that the median probability model will 
not always equal the optimal predictive model. Indeed, the two are the same 
only in the situation of Figure 1(c). In a sense, the theory in the preceding 
sections arose out of efforts to characterize situations in which the predictive 



Table 8 
Posterior probabilities and posterior expected losses for Hald^s data 



Model 
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Model 


P{M,\y) 


R{Mi) 
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0.000003 


2652.44 


c,2,3 


0.000229 


353.72 


c, 1 


0.000012 


1207.04 


c,2,4 


0.000018 


821.15 


c,2 


0.000026 


854.85 


c,3,4 


0.003785 


118.59 


c,3 


0.000002 


1864.41 


c,l,2,3 


0.170990 


1.21 


c,4 


0.000058 


838.20 


c,l,2,4 


0.190720 


0.18 


c,l,2 


0.275484 


8.19 


c,l,3,4 


0.159959 


1.71 


c,l,3 


0.000006 


1174.14 


c,2,3,4 


0.041323 


20.42 


c,l,4 


0.107798 


29.73 


c, 1,2, 3, 4 


0.049587 


0.47 
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risk representation would be as in Figure 1(c). We found that this is so if 
X'X is diagonal and (17) holds, as in Section 3.1. We also found this to be 
true in the nested model case discussed in Section 3.3. [Indeed, the resulting 
figure is simply a rotated version of Figure 1(c).] Subsequently we were able 
to develop the more general algebraic theories in those sections, but they 
were based on insights obtained through the geometric formulation. 

One can seek alternative theories based on observations in the geometric 
formulation. For instance, notice that if the triangle in Figure 1(b) were 
equilateral, then O and G would coincide and the maximum probability 
model would equal the optimal predictive model. Unfortunately, we could 
not find any useful general conditions under which the triangle would be 
equilateral. 

6. Concluding comments. 

6.1. When the theory does not apply. The conditions of the optimality 
theory for the median probability model are quite strong and will often not 
apply. Nevertheless, the fact that only the median probability model seems to 
have any optimality theory whatsoever suggests that it might quite generally 
be successful, even when the optimality theory does not apply. 

Example 3 (Continued). Suppose that all models (including at least the 
constant term) are considered for Hald's data. This does not formally satisfy 
the theory in Section 4, since the models are not nested and the conditions 
of Theorem 3 do not apply. But here the situation is simple enough that we 
can directly compute the posterior predictive losses corresponding to each 
of the possible models, using (16) and assuming equal prior probabilities of 
the models. The results are given in Table 8. 

Computation of the posterior inclusion probabilities yields 
Pi= Y. ^(Mi|y) =0.954556, P2= Y. ^(Mi|y) = 0.728377, 

l:«l = l 1:^2 = 1 

Pz= Y ^(Mi|y) =0.425881, Pi= Y ^(Mi|y) = 0.553248. 

1:/3 = 1 l:i4 = l 

Thus the median probability model is {c, xi,X2,X4}, which from Table 8 
clearly coincides with the optimal predictive model. Note that the risk of 
the maximum probability model {c,xi.,X2} is considerably higher than that 
of the median probability model. 

This example is typical; in our experience the median probability model 
considerably outperforms the maximum probability model in terms of pre- 
dictive performance. At the very least this suggests that the median prob- 
ability model should routinely be determined and reported along with the 
maximum probability model. 
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6.2. When does the median probability model fail? Suppose that the only 
models entertained are those with a constant term and a single covariate Xi, 
i = l, . . . ,k, with k>3, as well as the model with only a constant term. All 
models have equal prior probability of l/{k + 1). Furthermore, suppose that 
all covariates are nearly perfectly correlated with each other and with y. 
Then the posterior probability of the constant model will be near zero, and 
that of each of the other models will coincide with the posterior inclusion 
probabilities of each of the Xj, and will be approximately 1/k. Since these 
posterior inclusion probabilities are less than 1/2, the median probability 
model will be the constant model, which will have very poor predictive 
performance compared to any of the other models. 

One might be tempted to conclude from this that the median probability 
model might be problematical if there are highly correlated covariates. We 
have not yet observed such a difficulty in practice, however. Indeed, the 
Hald example given in the previous section is an example in which there is 
high correlation between covariates, yet we saw that the median probability 
model was still the best. 

6.3. Use of posterior inclusion probabilities. In addition to being key to 
defining the median probability model, the posterior inclusion probabilities 
in (6) can be important tools in assessing the effect of covariates, as indicated 
in the ANOVA example. One can, furthermore, define joint posterior inclu- 
sion probabilities of covariates; these can be very useful in unraveling the 
effects on model selection of correlations among covariates. See Nadal (1999) 
for examples. Finally, posterior inclusion probabilities are a key element in 
some of the most effective search strategies in model space; compare Berger 
and Molina (2002). The importance of posterior inclusion probabilities was 
emphasized in Mitchell and Beauchamp (1988). 

Acknowledgments. The authors are grateful to Merlise Clyde for sug- 
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